Contents

0.1 Introduction

Tumorigenesis is a stepwise process that is driven by a sequence of molecular changes forming pathways of cancer progression. Conjunctive Bayesian Networks are probabilistic-graphical models designed for the analysis and modeling of these pathways [1]. CBN models have evolved into different varieties such as CT-CBN [2], H-CBN [3], B-CBN [4] and R-CBN [5] each addressing different aspects of this task. However, the software corresponding to these methods are not well-integrated as they are implemented in different languages with heterogeneous input and output formats. This necessitates a unifying platform that integrates these models and enables standardization of the input and output formats to facilitate the downstream pathway analysis and modeling. Evam-tools [6] is an R package, which has taken the initial steps towards this end. However, it partially serves this purpose, as it does not include the B-CBN model and the recently developed R-CBN algorithm, which focuses on the robust inference of cancer progression pathways [5]. Importantly, the B-CBN and R-CBN algorithms for pathway quantification require exhaustive consideration and weighting of all the potential dependency structures (posets) within mutational quartets. This entails re-implementation of the CBN models and adjustment of the downstream pathway analysis and modeling functions. Therefore, here we introduce CBN2Path R package that not only includes the original implementation of the CBN models (e.g. CT-CBN and H-CBN) in a unifying interface, but it also accommodates the necessary modifications to support the robust CBN algorithms (e.g. B-CBN and R-CBN). In summary, CBN2Path is an R package that supports robust quantification, analysis and visualization of cancer progression pathways from cross-sectional genomic data, and so we anticipate that it will be a widely-used package in the future.

0.2 Installation

The package can be installed as follows:

if (!requireNamespace("BiocManager"))
    install.packages("BiocManager")
BiocManager::install("CBN2Path")
library(CBN2Path)

0.3 Cite our work

If you use the CBN2Path package, please cite the paper formally as follows:

William Choi-Kim and Sayed-Rzgar Hosseini. CBN2Path: an R/Bioconductor package for the analysis of cancer progression pathways using Conjunctive Bayesian Networks. F1000Research (In Submission).

0.4 The CT-CBN model

CBN2Path provides the R interface for the continuous-time CBN model (CT-CBN), which was originally implemented in C programming language [2].

CT-CBN needs the following three inputs:

  1. numMutations: The number of mutations to be considered.

  2. poset: The partially ordered set (poset), which is represented as a two-column matrix each row of which indicate that mutation in the first column must occur before the mutation in the second column.

  3. genotypeMatrix: a binary matrix with n rows and m columns. Each row corresponds to a given genotype in the sample. The first column must be always 1 and each of the other columns corresponds to a given mutation. Thus, m equals numMutations plus one.

Below, you can see an example on how these inputs are prepared to be used in the original implementation of the ctcbn model (ctcbn_single):

# The poset
DAG<-matrix(c(3,3,4,4,1,2,1,2),4,2)

# The genotype matrix
set.seed(100)
Gen1<-c(rep(0,150),sample(c(0,1),25,replace=TRUE),rep(0,25))
Gen2<-c(rep(0,175),sample(c(0,1),25,replace=TRUE))
Gen3<-c(rep(0,50),sample(c(0,1),100,replace=TRUE),rep(1,50))
Gen4<-c(sample(c(0,1),100,replace=TRUE),rep(0,50),rep(1,50))
gMat<-matrix(c(Gen1,Gen2,Gen3,Gen4),200,4)
gMat<-cbind(1,gMat)

# Preparing the inputs of the ct-cbn method
bc <- Spock$new(
     poset = DAG,
     numMutations = 4,
     genotypeMatrix =gMat
)

# Running the ct-cbn model
ResultsC<-ctcbn_single(bc)

Note that in the above example, we have generated a genotype matrix that perfectly matches the constraints specified in the given poset. In other words, in all genotypes in this sample of size 200, mutations 1 and 2 occur only if mutations 3 and 4 are already present.

It is important to note that in the above example, we have used the cbind function to make sure to have the first column always is 1 in all genotypes, and the other four columns respectively represent the mutations 1 to 4.

The maximum likelihood poset that the ct-cbn model outputs can be obtained as:

MLposet<-ResultsC[[1]]$poset$sets

The directed acyclic graph representation of the ML poset can be visualized using the visualize_cbn_model function as follows:

visualize_cbn_model(MLposet)
#> Using "sugiyama" as default layout

Furthermore, the Lambda parameters and the corresponding log likelihood can be obtained as:

MLlmbda<-ResultsC[[1]]$lambda
loglikelihood<-ResultsC[[1]]$summary[4]

You can repeat the above analysis using an imperfect genotype matrix (gMat_mut) for which some of the genotypes violate the pre-specified constraints in the mutational orders. For this purpose, we use the GenotypeMatrix_Mutator function, which subjects the original perfect genotype matrix (gMat) to false-positive and false-negative error rates of 0.3 and 0.2, respectively:

temp<-gMat[,2:5]
temp_mut<-GenotypeMatrix_Mutator(temp,0.3,0.2)
gMat_mut<-cbind(1,temp_mut)

and then you can rerun the ct-cbn model using the mutated genotype matrix as follows:

# The poset
DAG<-matrix(c(3,3,4,4,1,2,1,2),4,2)
# Preparing the inputs of the ct-cbn method
bc <- Spock$new(
     poset = DAG,
     numMutations = 4,
     genotypeMatrix =gMat_mut
)
# Running the ct-cbn model
ResultsC_mut<-ctcbn_single(bc)

You can check whether the ML poset now is different from the original one obtained using the perfect genotype matrix:

MLposet_mut<-ResultsC_mut[[1]]$poset$sets
visualize_cbn_model(MLposet_mut)
#> [1] "This is an empty poset, so no need for visualization."

As you can see in the message above that the ML poset is an empty poset now, meaning that after adding the errore, the model is no longer able to detect any restrictions between the mutations.

Note that if the posets and genotype data are stored in the original format needed in the C implementation of the CBN models, you can preprocess those files using read_poset and read_pattern functions in the CBN2Path package. You can see an example below, where the number of mutations is 10:

example_path <- get_examples()[1]
bc <- Spock$new(
     poset = read_poset(example_path)$sets,
     numMutations = read_poset(example_path)$mutations,
     genotypeMatrix = read_pattern(example_path)
)
ResultsC2<-ctcbn_single(bc)

Finally, you can obtain and store the results using a list of posets (instead of a single poset) as the input of the model using the ctcbn function. In the example below, we will consider all 219 unique (transitively-closed) DAGs as our list of posets that we use as the input of the ctcbn function:

Posets <- readRDS(system.file("extdata", "Posets.rds", package = "CBN2Path"))

bc <- Spock$new(
     poset = Posets,
     numMutations = 4,
     genotypeMatrix =gMat
)
ResultsC3<-ctcbn(bc)

You can see that the result is a list of 219 components each including the estimated parameters corresponding to one of the 219 posets in the input list. This strategy is utilized in the R-CBN and B-CBN models for quantifying the pathway probabilities.

You can obtain the log likelihood corresponding to each poset as follows:

LogLik<-numeric(219)
for (i in 1:219){
  LogLik[i]<-ResultsC3[[i]]$summary[4]
}

You can verify that the maximum-likelihood poset is the same as the poset that ctcbn_single outputs as the inferred poset (usig the error-free gMat genotype matrix).

INDX<-which.max(LogLik)
MLposet2<-Posets[[INDX]]
identical(MLposet2,MLposet)
#> [1] TRUE

0.5 The H-CBN model

The input/output structure of the H-CBN model (hcbn_single and hcbn) [3] is exactly the same as in the CT-CBN model (ctcbn_single and ctcbn) described above.

# The poset
DAG<-matrix(c(3,3,4,4,1,2,1,2),4,2)

# Preparing the inputs of the h-cbn method
bc <- Spock$new(
     poset = DAG,
     numMutations = 4,
     genotypeMatrix =gMat
)
# Running the h-cbn model
ResultsH<-hcbn_single(bc)

The Lambda parameters and the corresponding log likelihood can be obtained as:

MLlmbdaH<-ResultsH[[1]]$lambda
loglikelihoodH<-ResultsH[[1]]$summary[4]

You can also rerun the h-cbn model using the mutated genotype matrix as follows:

# The poset
DAG<-matrix(c(3,3,4,4,1,2,1,2),4,2)
# Preparing the inputs of the h-cbn method
bc <- Spock$new(
     poset = DAG,
     numMutations = 4,
     genotypeMatrix =gMat_mut
)
# Running the h-cbn model
ResultsH_mut<-hcbn_single(bc)

Again the poset and genotype files stored in the original formats can be processed using read_poset and read_pattern functions in the CBN2Path package.

example_path <- get_examples()[1]
bc <- Spock$new(
     poset = read_poset(example_path)$sets,
     numMutations = read_poset(example_path)$mutations,
     genotypeMatrix = read_pattern(example_path)
)
ResultsH2<-hcbn_single(bc)

Finally, you can obtain and store the results using a list of posets (instead of a single poset) as the input of the model using the hcbn function.

Posets <- readRDS(system.file("extdata", "Posets.rds", package = "CBN2Path"))

bc <- Spock$new(
     poset = Posets,
     numMutations = 4,
     genotypeMatrix =gMat
)
ResultsH3<-hcbn(bc)

You can obtain the log likelihood corresponding to each poset as follows:

LogLikH<-numeric(219)
for (i in 1:219){
  LogLikH[i]<-ResultsH3[[i]]$summary[4]
}

You can verify that the maximum-likelihood poset is the same as the poset that hcbn_single outputs as the inferred poset (usig the error-free gMat genotype matrix).

MLposetH<-ResultsH2[[1]]$poset$sets

INDX<-which.max(LogLikH)
MLposetH2<-Posets[[INDX]]
identical(MLposetH2,MLposetH)
#> [1] FALSE

0.6 Analysis of cancer progression pathways

One of the important feature of the CBN2Path package is its emphasis on mutational pathway analyses and modeling. In this section, we will work with a set of functions that enable quantification, analysis and visualization of the mutational pathways.

There are two approaches to quantify the pathway probabilities:

  1. The first approach is to use the output of the ct-cbn or h-cbn methods (namely the estimated lambda parameters and the inferred ML poset) as input of the PathProb_CBN function. This method is generic as it can be used for every number of mutations.

  2. The second approach works only for mutational quartets (n=4) and uses the genotypic data directly as the input. In this setting, each CBN model has its own pathway quantification functions: PathProb_Quartet_CTCBN, PathProb_Quartet_HCBN, PathProb_Quartet_RCBN, and PathProb_Quartet_BCBN.

As examples for the first approach, let’s use the ResultsC2 and ResultsH2 that we obtained in the previous section by learning respectively, ct-cbn and h-cbn models on genotypic data with n=10 mutations. First, we need to obtain the estimated lambda parameters and the inferred DAG and then use them as the input of the PathProb_CBN function as follows:

lambdaC<-as.numeric(ResultsC2[[1]]$lambda)
lambdaH<-as.numeric(ResultsH2[[1]]$lambda)
dagC<-ResultsC2[[1]]$poset$sets
dagH<-ResultsH2[[1]]$poset$sets

ProbC<-PathProb_CBN(dagC,lambdaC,10)
ProbH<-PathProb_CBN(dagH,lambdaH,10)

Note that in the above code, probabilities of 10!=3,628,800 pathways need to be calculated, which is extremely time-consuming, so we have not executed this chunk of the code.

Now, let’s try the second approach using both the gMat and gMat_mut genotype matrices. Note that in these functions, the number of columns in the input genotype matrix must always be four, and the first column does not need to be an all-one column. Therefore, we must first eliminate the first column from the gMat and gMat_mut matrices.

gMat2<-gMat[,2:5]
gMat2_mut<-gMat_mut[,2:5]

ProbC1<-PathProb_Quartet_CTCBN(gMat2)
ProbC2<-PathProb_Quartet_CTCBN(gMat2_mut)

ProbH1<-PathProb_Quartet_HCBN(gMat2)
ProbH2<-PathProb_Quartet_HCBN(gMat2_mut)

You can visualize the 24 pathways and their associated probabilities using visualize_probabilities function. In the figures below, you can compare the pathway probability distributions (quantified using CT-CBN model) before and after errors:

visualize_probabilities(ProbC1)

visualize_probabilities(ProbC2)

You can see before adding errors (ProbC1), only four pathways are feasible with non-zero probabilities, but in the presence of error (ProbC2) all pathways are feasible and so the probability is more uniformly distributed among the pathways.

Now, let’s assume that the four genes in consideration are: “KRAS”, “TP53”, “CDKN2A”, “RREB1”. We can now visualize the pathways with gene names and their probability distributions (ProbC2) as follows:

gene_names<-c("KRAS", "TP53", "CDKN2A", "RREB1")
visualize_probabilities(ProbC2,geneNames=gene_names)

Similarly, in the figures below, we can compare the pathway probability distributions (quantified using H-CBN model) before (ProbH1) and after errors (ProbH2):

visualize_probabilities(ProbH1)

visualize_probabilities(ProbH2)

We can see that under the H-CBN model, the pathway probabilities before and after errors stay more similar than what we observed under the CT-CBN model. We can formally quantify to what extent the two probability distribution differ using the Jensen-Shannon Divergence (JSD; implemented as Jensen_Shannon_Divergence function).

JSD_C<-Jensen_Shannon_Divergence(ProbC1,ProbC2)
JSD_H<-Jensen_Shannon_Divergence(ProbH1,ProbH2)
JSD_C
#> [1] 0.4141736
JSD_H
#> [1] 0.313736

We can also quantify the predictability for a given pathway probability distribution as described in [7] using the Predictability function. We can see that the predictability after errors drops substantially under CT-CBN:

Pred_C1<-Predictability(ProbC1,4)
Pred_C2<-Predictability(ProbC2,4)
Pred_C1
#> [1] 0.5662913
Pred_C2
#> [1] 0.06947874
Pred_C1-Pred_C2
#> [1] 0.4968126

In contrast, under H-CBN, the predictability after errors, even gets slightly higher than the one obtained in the error-free setting:

Pred_H1<-Predictability(ProbH1,4)
Pred_H2<-Predictability(ProbH2,4)
Pred_H1
#> [1] 0.5662912
Pred_H2
#> [1] 0.6972734
Pred_H1-Pred_H2
#> [1] -0.1309822

Finally, we can compute the pathway compatibility scores both for gMat2 and gMat2_mut genotype matrices using the Pathway_Compatibility_Quartet function as follows:

PathwayC1<-Pathway_Compatibility_Quartet(gMat2)
PathwayC2<-Pathway_Compatibility_Quartet(gMat2_mut)

The Spearman’s correlation coefficient between the pathway compatibility and the pathway probabilities quantified under CT-CBN or H-CBN can be quantified as follows:

RhoC1<-cor(PathwayC1,ProbC1,method="spearman")
RhoC2<-cor(PathwayC2,ProbC2,method="spearman")
RhoC1
#> [1] 0.6622063
RhoC2
#> [1] 0.956276
RhoH1<-cor(PathwayC1,ProbH1,method="spearman")
RhoH2<-cor(PathwayC2,ProbH2,method="spearman")
RhoH1
#> [1] 0.6622063
RhoH2
#> [1] 0.5464635

0.7 The R-CBN algorithm

The R-CBN algorithm [5] for quantifying pathway probability distributions is implemented in the PathProb_Quartet_RCBN function, whose input/output structure is similar to what we described above for the CT-CBN and H-CBN. However, there are multiple functions, which are called inside the PathProb_Quartet_RCBN function, and so the user do not need to directly work with (e.g. Path_Normalization, Pathway_Weighting_RCBN, EdgeMarginalized, Path_Edge_Mapper, and Poset_Weighting_RCBN).

The PathProb_Quartet_RCBN function also receives a four-column binary genotype matrix as the only input, and outputs the corresponding pathway probability distribution. Let’s quantify the pathway probability distributions before and after adding errors:

gMat2<-gMat[,2:5]
gMat2_mut<-gMat_mut[,2:5]

ProbR1<-PathProb_Quartet_RCBN(gMat2)
ProbR2<-PathProb_Quartet_RCBN(gMat2_mut)

In the figures below, you can compare the pathway probability distributions (quantified using R-CBN model) before and after errors:

visualize_probabilities(ProbR1)

visualize_probabilities(ProbR2)

Similar to what we described before, the Jensen-Shannon Divergence between the two distributions can be quantified as:

JSD_R<-Jensen_Shannon_Divergence(ProbR1,ProbR2)
JSD_R
#> [1] 0.05842678

You can see that the JDS value under R-CBN in this example (0.05), is considerably smaller than those of CT-CBN (0.41) and H-CBN (0.31).

The predictability values can also be compared as follows:

Pred_R1<-Predictability(ProbR1,4)
Pred_R2<-Predictability(ProbR2,4)
Pred_R1
#> [1] 0.5397152
Pred_R2
#> [1] 0.3825031
Pred_R1-Pred_R2
#> [1] 0.1572121

Finally, the Spearman’s correlation coefficient between the pathway compatibility and the pathway probabilities quantified under R-CBN can be quantified as follows:

RhoR1<-cor(PathwayC1,ProbR1,method="spearman")
RhoR2<-cor(PathwayC2,ProbR2,method="spearman")
RhoR1
#> [1] 0.957629
RhoR2
#> [1] 0.981945

0.8 The B-CBN method

The workflow for the B-CBN algorithm [4], which is implemented in the PathProb_Quartet_BCBN function, is also similar to that of R-CBN.

The PathProb_Quartet_BCBN function also receives a four-column binary genotype matrix as the only input, and outputs the corresponding pathway probability distribution. Again, let’s quantify the pathway probability distributions before and after the errors:

gMat2<-gMat[,2:5]
gMat2_mut<-gMat_mut[,2:5]

ProbB1<-PathProb_Quartet_BCBN(gMat2)
ProbB2<-PathProb_Quartet_BCBN(gMat2_mut)

In the figures below, you can compare the pathway probability distributions (quantified using B-CBN model) before and after errors:

visualize_probabilities(ProbB1)

visualize_probabilities(ProbB2)

Similar to what we described before, the Jensen-Shannon Divergence between the two distributions can be quantified as:

JSD_B<-Jensen_Shannon_Divergence(ProbB1,ProbB2)
JSD_B
#> [1] 0.07498265

The predictbility values can also be compared as follows:

Pred_B1<-Predictability(ProbB1,4)
Pred_B2<-Predictability(ProbB2,4)
Pred_B1
#> [1] 0.4632808
Pred_B2
#> [1] 0.264833
Pred_B1-Pred_B2
#> [1] 0.1984478

Finally, the Spearman’s correlation coefficient between the pathway compatibility and the pathway probabilities quantified under B-CBN can be quantified as follows:

RhoB1<-cor(PathwayC1,ProbB1,method="spearman")
RhoB2<-cor(PathwayC2,ProbB2,method="spearman")
RhoB1
#> [1] 0.9106342
RhoB2
#> [1] 0.9597566

0.9 Analysis of fitness landscapes

If we can establish a fitness landscape by directly assigning fitness to all potential genotypes, we can employ evolutionary models to compute the mutational pathway probabilities under the Strong-Selection Weak-Mutation (SSWM) assumption [7], which is implemented in the PathProb_SSWM function.

In case of 4 mutations, we will have 16 genotypes, so a fitness vector length of 16 is needed each element of which corresponds to a given genotype that can be determined by the generate_matrix_genotypes function. For example:

F<-c(0,0.1,0.2,0.1,0.2,0.4,0.3,0.2,0.2,0.1,0,0.6,0.4,0.3,0.2,1)
G<-generate_matrix_genotypes(4)

The 7-th genotype in the G vector is “1 0 1 0” and its corresponding fitness is F[7]=0.3.

The fitness landscape can be visualized as follows:

visualize_fitness_landscape(F)

The pathway probability distribution under the SSWM-based model can be quantified as:

Prob_W<-PathProb_SSWM(F,4)

Moreover, the pathway probabilities can be visualized as:

visualize_probabilities(Prob_W)

and finally the associated predictability can be quantified as:

Pred_W<-Predictability(Prob_W,4)

0.10 References

[1] Beerenwinkel, et al. Conjunctive Bayesian Networks. Bernoulli, 13(4):893–909, November 2007. ISSN 1350-7265. doi: https://doi.org/10.3150/07-BEJ6133.

[2] Beerenwinkel and Sullivant. Markov models for accumulating mutations. Biometrika, 96 (3):645–661, September 2009. ISSN 0006-3444, 1464-3510. doi: https://doi.org/10.1093/biomet/asp023.

[3] Gerstung, et al. Quantifying cancer progression with conjunctive Bayesian networks. Bioinformatics (Oxford, England), 25(21):2809–2815, November 2009. doi: https://doi.org/10.1093/bioinformatics/btp505.

[4] Sakoparnig and Beerenwinkel. Efficient sampling for Bayesian inference of conjunctive Bayesian networks. Bioinformatics, 28(18):2318–2324, September 2012. ISSN 1367-4811, 1367-4803. doi: https://doi.org/10.1093/bioinformatics/bts433.

[5] Hosseini. Robust inference of cancer progression pathways using Conjunctive Bayesian Networks, BioRxiv. July 2025. doi: https://doi.org/10.1101/2025.07.15.663924.

[6] Diaz-Uriarte and Herrera-Nieto. EvAM-Tools: tools for evolutionary accumulation and cancer progression models. Bioinformatics, 38(24): 5457–5459, December 2022. ISSN 1367-4803, 1367-4811. doi: https://doi.org/10.1093/bioinformatics/btac710.

[7] Hosseini, et al. Estimating the predictability of cancer evolution. Bioinformatics, 35 (14):i389–i397, July 2019. ISSN 1367-4803, 1367-4811. doi: https://doi.org/10.1093/bioinformatics/btz332.